Comments for MEDB 5501, Week 6

Assessing normality

  • Problems caused by non-normality
    • Poor confidence intervals, hypothesis tests
      • Too much imprecision
      • Poor coverage probability
        • Especially for one tailed tests
    • Inability to extrapolate
  • What about the Central Limit Theorem?

How to handle non-normality

  • Ignore it
    • Central Limit Theorem
  • Transform your data
  • Use alternatives
    • Nonparametric tests (covered in a later module)
    • Bootstrap (covered in a later module)
    • Randomization tests (not covered in this class)

Approaches to examine normality

  • Histogram, boxplot
  • Normal probability plot
    • Normal correlation (not covered, not recommended)
  • Skewness, Kurtosis
  • Kolmogorov-Smirnow test (not recommended)
  • Shapiro-Wilk test (not recommended)

Never use p-values to test normality

  • H0: Data comes from a normal population
  • Why is this bad?
    • ASA recommendation against the use of p-values
    • Too little power for small sample sizes
    • Too much power for large sample sizes
    • Ignore the type of non-normality

Normal histogram with n=100

Normal histogram with n=1,000

Normal histogram with n=10,000

Normal histogram with n=100,000

Normal histogram with n=1,000,000

Normal distribution (n=infinity)

Normal(2, 1)

Normal(-1, 1)

Normal(0, 2)

Normal(0, 0.5)

Skewed right

Right skewness is characterized by the tails of the distribution

  • Heavy right tail
    • Greater tendency to produce extreme values on the right
  • Light left tail
    • Lesser tendency to produce extreme values on the left
  • Right skewness is the most common type of non-normality

Normal probability plot

  • Compare data to evenly spaced percentiles of the normal distribution
  • Example with n=4
    • Compare smallest value with \(Z_{0.2}\)
    • Compare next value with \(Z_{0.4}\)
    • Compare next value with \(Z_{0.6}\)
    • Compare largest value with \(Z_{0.8}\)
  • No best definition for evenly spaced
    • 12.5, 37.5, 62.5, 87.5, for example

Interpreting, heavy left tail

Interpreting, light left tail

Interpreting, heavy right tail

Interpreting, light right tail

Right skewed data

Left skewed data

Heavy tailed data

Light tailed data

Bimodal data

Normal data

Log transformation

  • If \(a^b=c\), then \(log_a(c)=b\)
  • \(log(a \times b)=log(a)+log(b)\)
  • Three commonly used bases
    • log base 10: (\(log_{10}\))
    • log base 2: (\(log_2\))
    • natural log, log base e: (\(ln\))
  • Important! SPSS uses lg10, not log10.

Where the log transformation is routine

  • Log units are common in science
    • Richter scale (1 unit equals 10 fold change)
    • Decibel (20 units equals 10 fold change)

Why use the log function

  • Stretches small values
  • Squeezes large values
  • Possible benefits
    • Removing skew
    • Eliminating outliers
    • Stabilizing variation
    • Model simplification

When to use a log transformation

  • Data bounded below by zero
  • Data defined as a ratio
  • Max / Min > 3

Squeezing

Stretching

Skewness

Outliers

Unequal variation

Multiplicative models

  • Additive model
    • Catalogs +1,000 causes sales +$5,000
  • Multiplicative model
    • Rain + 1 inch causes pollen * 0.5
  • Multiplicative models are tricky
  • Log converts it to an additive model

Example: Metabolic ratio

Skewness

Outliers

Unequal variation

Further reading

  • Oliver N. Keene. The log transformation is special. Keene ON. Stat Med 1995: 14(8); 811-9. Article is behind a paywall
  • Wikipedia. The Log-normal distribution. Available in html format

Algebra formula for a straight line

  • \(Y=mx+b\)
  • \(m = \Delta y / \Delta x\)
  • m = slope
  • b = y-intercept

Linear regression interpretation of a straight line

  • The slope represents the estimated average change in Y when X increases by one unit.

  • The intercept represents the estimated average value of Y when X equals zero.

First regression example with interpretation

Output from SPSS

Interpretation when X is categorical

  • Code X as 0-1
  • Intercept: Estimated average value of Y for the “0” category.
  • Slope: Estimated average change in Y when category changes from “0” to “1”.

Second regression example with interpretation

SPSS output

Alternate coding

Figure 5. Scatterplot with alternate ordering of treatment

SPSS output